Vector datasets catalog and downloader#7446
Conversation
8c86302 to
434c57b
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
|
We probably want to mirror this somewhere. @AdamGS @robert3005 is there an easy way to do this? |
|
R2 is probably the easiest? Whatever we use for the clickbench data |
| } | ||
|
|
||
| /// Stream a large file to disk with a byte-progress bar. | ||
| async fn download_with_progress(client: &Client, url: &str, output: &PathBuf) -> Result<()> { |
There was a problem hiding this comment.
Why do we need another one of these? Can't this be part of the general download utils we have here?
There was a problem hiding this comment.
(this function and a bunch of the following ones)
There was a problem hiding this comment.
maybe? But afaict we dont use a reqwest client anywhere else
434c57b to
1a462ba
Compare
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1a462ba to
5d70dd0
Compare
AdamGS
left a comment
There was a problem hiding this comment.
overall, nothing objectionable here, lets ship it
|
ok im yoloing this since this doesnt affect anyone else. At some point it would be good to unify the downloading but if we are going to do that then we might as well implement the catalog idea that @joseph-isaacs had. |
|
@joseph-isaacs what's the catalog idea? worth writing down somewhere? |
## Summary Tracking issue: #7297 Adds a TurboQuant demo where we convert the parquet files to a Vortex file (in-memory only now, but still serialized as bytes), and then we verify by decoding and performing a basic cosine similarity expression search with a filter pushdown. This is based on top of #7446, please dont merge until that has merged ## Testing The example runs! Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Summary
Tracking issue: #7297
We will want to add vector benchmarking soon (see #7399 for a draft).
This adds a simple catalog for the vector datasets hosted by
https://assets.zilliz.com/benchmarkfor VectorDBBench, which both describes the shape of the datasets (are things partitioned, randomly shuffled, are there neighbors lists for top k, etc).Also handles downloading everything.
I had to verify that all of this stuff was correct by looking at the S3 buckets themselves:
Details
And this script from the main repo helped too: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py
Things that are not implemented that I would like to add:
Testing
N/A